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Abstract. In this paper wc study irroducibility in RNA structurcs. By RNA structure we 
mean RNA secondary as wcU as RNA pscudoknot structures. In our analysis wo shall contrast 
random and minimum froc cnergy (mfo) configurations. We compute various distributions: of 
tho numbors of irroduciblc substructurcs, thcir locations and sizcs, paramotorizod in tcrms of the 
maximal numbcr of mutually crossing arcs, fc — 1, and the minimal sizc of stacks a. In particular, 
we analyzo tho sizc of thc largcst irrcducible substructure for random and mfe structures, which 
is thc koy factor for tho folding timc of mfc configurations. 



1. Introduction AND BACKGROUND 



In this paper we study irreducibility in RNA structures. Intuitively, an irreducible substructure 
over a subsequence is a configuration of bonds, beginning and ending with arcs of certain stack 
size, that cannot be written as a nontrivial concatenation of snialler configurations. Since any 
minimum free energy (mfe) folding algorithm depends at least polynomiaUy (to a degree larger 
than onc) on the sequence length, the size of the largest, irreducible substructurc dctermines the 
folding time. 

Let us bcgin by recalhng some basic facts about RNA structurcs: an RNA structurc is the hehcal 
configuration of its primary scqucnce, i.e. thc scquence of nuclcotidcs A, G, U and C, togcther with 
Watson-Crick (A-U, G-C) and (U-G) base pairs. One well-known class of RNA structures, are 
RNA secondary structures, pioneered three decades ago by Waterman [TTl [TÜl [ITl [2l [18] . Secondary 

Date: January, 2009. 

2000 Mathematics Subject Classification. 05A16. 

Key words and phrases. pseudoknot, singularity analysis, fc-noncrossing cr-canonical (fc-(nc), (j-(ca)) diagram, 
fc-noncrossing cr-canonical (fc-(nc), (T-(ca)) RNA structure, irreducible substructure, return, largcst irrcducible 
substructure. 



2 



EMMA Y. JIN* AND CHRISTIAN M. REIDYS*'t 



structures exhibit exclusively noncrossing bonds and are subject to specific minimum arc-length 
conditions. Thcy can readily be identified with Motzkin-paths satisfying some minimum height 
and plateau-length, see FigurefT] [18¡. The latter restrictions comc from biophysical constraints due 




to mfe loop-energy parameters and hmited flexibihty of bonds. It is clear from the above bijection, 
that irreducible substructures in RNA secondary structurcs are closely relatcd to the number of 
nontrivial returns, i.e. the number of non-endpoints, for which the Motzkin-path meets the x-axis. 
As a purely combinatorial problcm this has been studied by [11 [7] . 
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It is well-known that RNA configurations are far more complex than secondary structures: they 
exhibit additional, cross-serial nucleotide interactions jl2j . These interactions were observed in 
natural RNA structures, as well as via comparative sequence analysis ¡19j . They are caUed pseu- 
doknots, see FigureO and widely occur in functional RNA, hke for instance, eP RNA [9] as weU as 
ribosomal RNA RNA pseudoknots arc conserved also in the catalytic corc of group I introns. 
In plant viral RNAs pscudoknots mimic tRNA structure and in vitro RNA cvolution [14] exper- 
iments havc produced familics of RNA structurcs with pseudoknot motifs, whcn binding HIV-1 
revcrsc transcriptase. 




FlGURE 2. The mRNA plasmid pMU720 (IncB)-pseudoknot structure; its planar graph 
(top) and diagram representation (bottom). 



CombinatoriaUy, cross serial interactions are tantamount to crossing bonds. Thereforc, RNA pseu- 
doknot structures have been modeled as fc-noncrossing (fc-(nc)) diagrams [5l[6l[TÜ], i.e. labeled 
graphs over the vertex set [n] = {1, . . . , ?i} with degree < 1. Diagrams are represented by drawing 
their vertices 1, . . . , ?i in a horizontal linc and thcir arcs (¿, j), where i < j, in thc uppcr haU-plane. 
Here thc degrec of i rcfers to the number of non-horizontal arcs incidcnt to i, i.e. the backbone 
of the primary scquence is not considcred. The vertices and arcs correspond to nucleotides and 
Watson-Crick (A-U, G-C) and (U-G) base pairs, respectively, see Figure[3] Diagrams are char- 
acterized via their maximum number of mutually crossing arcs, fc — 1, their minimum arc-length, A, 
and thcir minimum stack-length, a. A fc-crossing is a sct of k distinct arcs (ii, ji), (¿2, J2), • ■ • {ik, jk) 
with thc propcrty ii < 12 < ■ ■ ■ < ik < ji < ji < ■ ■ ■ < jk- A diagram without any fc-crossings is 



4 



EMMA Y. JIN* AND CHRISTIAN M. REIDYS*'t 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 



FlGURE 3. fc-(nc) diagrams: we display a 4-(nc), arc-length A > 4 and cr > 1 diagram 
(top), where the edge set {(1, 6), (2, 9), (4, 12)} is a 3-crossing, the arc (10, 14) has length 
4 and (1,6) has stack-length 1. Below, we display a 3-(nc), A > 5 and a > 3 (lower) 
diagram, where (3, 8) has arc-length 5 and the stack ((1, 10), (2, 9), (3, 8)) has stack-length 
3. 

called a fc-(nc) diagram. The length of an arc is j — i and a stack of length cr is a sequence 
of "parallcl" arcs of thc form 

((í, j), (¿ + 1, J - 1), . . . , (¿ + - 1), j -{a- 1))). 

A subdiagram of a fc-(nc) diagram is a subgraph over a subset M C [n\ of consecutive vertices 
that starts with an origin and ends with a terminus of some arc. Let (¿i, . . . , be a sequence of 
isolated points, and {ji^j^) be an arc. We call (ii, . . . ,¿m) interior if and only if there exists some 
arc (ji, j^) such that ji < zi < ¿„¡ < j^ holds and exterior, otherwise. Any exterior sequence of 
consecutive, isolated vertices is cahed a gap. A diagram and subdiagram is caUed irreducible, if it 
cannot be decomposed into a (nontrivial) sequence of gaps and subdiagrams, see FigurelH As a 
result, any fc-(nc) diagram can bc uniqucly dccomposcd into an altcrnating sequcncc of gaps and 
irrcduciblc subdiagrams. Wc caU a fc-(nc), cr-canonical (cr-(ca)) diagram with arc-lcngth > 4 and 
stack-length > cr, a fc-(nc), cr-(ca) RNA structurc, see Figurc|3l We accordingly adopt the notions 
of gap, substructurc and irrcducibiUty for RNA structures. A fc-(nc), cr-(ca) RNA structurc has 
return at position ¿ if i is the endpoint of some irreducible substructure, see Figure Unique 
large irreducible substructures are quite common for natural RNA pseudoknot structures, see 
Figure[6l The size of the largest irreducible substructure is typicaUy very large: it contains almost 
aU nucleotides, see Figure[7l 

Thc papcr is organizcd as foUows: in Scction [2] wc rccaU somc combinatorial framcwork duc to 
[7]. In particular, wc dcrivc thc probabiUty gcncrating function for thc numbcr of irrcduciblc 
substructurcs. Wc rcmark that thc framcwork prcscntcd in Scction [2] can bc gencraUzcd to RNA 
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Subdiagram 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 



Gap Q Q □ CJ 

12 3 4 



Irreducible subdiagram 




FlGURE 4. Subdiagrams, gaps and irreducibility: a diagram (top), decomposed into the 
subdiagram over (1,6), the gap 7 and the subdiagram over (8, 19) and gap 20. A gap 
(middle) and an irreducible diagram over (1, 12) (bottom). 




1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 



FlGURE 5. A 3-(nc), 3-(ca) RNA structure has returns at position 13 and 24, respectively. 




200 

FlGURE 6. mRNA-Eca: the irreducible pseudoknot structure of the regulatory region 
of the a ribosomal protein operon. 

tertiary structures. In Section[2]we put these results to the test: we shall compare random and 
mfe structures. We begin by observing specific deviations of the distributions of mfc 2- and 3-(nc) 
structurcs for n ~ 75 and (7 = 3 from that of random structurcs. Thc rcst of the section we analyze 
these deviations and provc in the process (Proposition [T]) a "shiff'-result. The latter allows us to 
understand the effect of increasing the stack-sizc cr on irrcducibiUty. In Section[3]we derive simple 
formulas to the probabihties of return locations. i.e. the endpoints of irreducible substructures and 
contrast random and mfc sccondary and pscudoknot structures. In Section[5]we study the size of 
thc largest irreducible componcnts for A:-(nc), (T-(ca) RNA structurcs. 
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2. SOME COMBINATORICS 



Let Sj^j^^ dcnotc thc numbcr of fc-(nc), (T-(ca) RNA structurcs, containing cxactly j irrcduciblc 



substructurcs and let SÍt''^^ = J^jyo^ri. f^ ■ That is, Sn''^^ denotes the number of fc-(nc), (T-(ca) 
RNA structures. Thc bivariate generating function of thc S^^'J^^ indexed by j, thc mimbcr of 
irrcduciblc substructurcs and n, thc scqucnce lcngth is given by 



(fc,"") i r, 



(2.1) lJk,A^,u) = J2J2^^. 

n>0 j>0 

Let furthermore Tk,aiz) = X^„>o Sn''^\" and 'Rk^aiz) denote the generating function of irrcducible 
RNA structurcs. The following lemma [3 derivcs the gencrating function Uk,a{z,u): 

Lemma 1. The bivariate generating function of the number of k-(nc), a-(ca) RNA structures, 
which contain exactly j irreducible k-(nc), a-(ca) RNA substructures, is given by 

1 

TJk^aiz^u) 



Lemma [T] is the key for computing thc hmit distribution of thc numbcr of fc-(nc), cr-(ca) RNA 
structurcs that havc exactly j irreduciblc RNA substructurcs. For this purpose, let ^4*^''^'' be the 
r.v. having the probabihty distribution 

(2-2) ne'^'=j)-^j¿- 
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The theorem below [7] shows that the probabihties P(^i'^''^'' = j) satisfy a discrete hmit law 
Theorem 1. Let ak,a be the real positive dominant singularity ofTk.aiz) and 

Tfc.cr =1 



k,(T' 



(1 - ak,cr)Tk,cr{ctk,a) 

Then the r.v. S,íí^''^^ satisfies the discrete limit law 
(2.3) hm P(ei'^''") =i)= ~^^'"^' ii 

n^QO Tk,a 

That is, s}n''^'^ is determined hy the density function of a T(—liiTk.a,2)-distribution. Furthermore, 
the probability generating function probability generating function of the limit distribution is given 
by 



The generating function of fc-(nc), cr-(ca) RNA structures, Tk^aiz) and its dominant singularities, 
ak,(7, have been studicd in [51 [6l [10]. In particular thc hmiting probabihty of irrcducible RNA 
structures is given by 

(2.4) hm P(ei"'") = 1) = (l-Tk.a)^. 

n — *oo 

We observe that for fixed a and increasing crossing number, fc, the singularity Tk,a- decreases. 
Therefore the hmiting probabihty of RNA structures to be irreducible increases with increasing 
crossing number. However, for fixed k and increasing cr, the singularity Tk,a increases. Conse- 
quently, thc hmiting probabihty of RNA structures to be irreducible decreases with increasing a. 
Theorem [1] ahows to compute the characteristic function of the r.v. S^li''^^: 



By Taylor expansion of the characteristic function we obtain the fc-th moments of S,^''^^ i.e., 

(2.5) E[e**«í^''] = 1 + (íí)E[í(^'^)] + M!e[(í('=''^))2] + . . . + M^E^^ei'^'-))™] + o{t). 

2! m! 

(k a) 

Consequently, we can compute expectation and variance of ' for varying k and a, scc Ta- 
ble[T]and Table[2l Table [Tj shows, that for RNA secondary structurcs increasing the stack size a 
significantly increases rcducibility. Tablc [2| indicatcs that 3-(nc) RNA pseudoknot structures are 
typically irreducible with rather subtle dcpendcncc on thc minimum stack-size, a. 
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Table 1. fc = 2 Table 2. fc = 3 



















V[íi''"^] 


a 


= 3 


0.3201 


1.9416 


5.1548 


a 


= 3 


0.0167 


1.0340 


1.1036 


a 


= 4 


0.3441 


2.0492 


5.7991 


a 


= 4 


0.0208 


1.0425 


1.1302 


a 


= 5 


0.3615 


2.1323 


6.3203 


a 


= 5 


0.0244 


1.0500 


1.1538 



3. RNA RANDOM STRUCTURES AND RNA MFE STRUCTURES 



In this scction wc analyze irreducibility in random and mfc RNA sccondary and pscudoknot- 
structurcs. As folding algorithms for the gencration of the mfc RNA sccondary and pseudoknot 
structurcs we employ Vicnna RNA [15] and cross [3|- Wc shall bcgin by comparing in Figurc[8] 
irreducibihty of 2-(nc) and 3-(nc) random and mfe structures (of length n = 75) for minimum 
stack size cr = 3. Figurc [8| shows that: (a) for 2-(nc) 3-(ca) structures the mfe structures are more 
irreducible than their random counterparts and (b) for 3-(nc) structures the contrary is being 
observed: 3-(nc) mfc structures are less irreducible than 3-(nc) random structures. 




2 4 6 8 



FlGURE 8. Random versus mfe: the Ihs shows the distribution of irreducibles in 2-(nc), 
3-(ca) mfe (red) and random structures (blue), for n = 75. The rhs showcases these 
distributions for 3-(nc), 3-(ca) mfe (red) and random structures (blue) for n = 75. 

In order to understand thc above observations, let us procced by analyzing first the cffect of 
increasing the minimum stack-sizc a for 2- and 3-(nc) mfc structures. While wc havc shown in 
Scction [2| that thc Hmit distribution shifts towards less irrcducibiHty when increasing cr, Figurc[9| 
shows that for fixcd cr, 2-(nc) as well as 3-(nc) mfc structurcs becomc less irrcduciblc when thc 
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sequence length n increases. Intuitively, the increase in a for fixed n imphes that any irreducible 
substructures has to become larger. Therefore, in hght of the fact that there are only a few 
irreducible substructures, disaUowing for these smaU irreducible substructures imphes the shift 
towards irreducibihty. With this picture in mind, we shaU proceed by quantifying this phenomenon: 
let 




FlGURE 9. Finite size eífect: here we insert the distribution of irreducibles of 2-(nc), 
3-ca mfe structures for n = 85 (magenta) into Figure[8l 



1 



Tfc,.(^). 

i.e. Rk^cr{n) denotes the number of irreducible structures over n nucleotides and Pk^a.jin) denotes 
the number of fc-(nc), CT-(ca) RNA structures with exactly j irreducible substructures. Then, 
clearly, Rk,a{n) < ¡3k,cr.i{n)- We shaU prove that, for sufficiently large n, the scaling factor needed 
for passing from ct to ct + 1 is ■ 

Proposition 1. For sufficiently large n and arbitrary a , we have 



(3.1) RkA^ 
Furthermore, Pk.a,jin) satisfies 

(3.2) (3k,a+l,j 



( hi{ak.a) 



\\n{ak.a+i 

\n[ak,a) 



R. 



■k,a+l 



_lll{ak,a+l) 



ln{ak,a) 

lli{ak,a+l) 

f3k,a,]{n). 



In order to iUustrate Proposition [Tl we consider 2-(nc), 3-(ca) and 2-(nc), 4-(ca) structures. Ac- 
cording to eq. (|7.3p 



(2) 



ln(afc,, 



ln(afc,o 



- 1 



= 75,fc=2,CT=3 



75 



/ln(0.6053) 
Vln(0.6504) 



- 1 



= 12 
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and for 3-(nc), 3-(ca) and 3-(nc), 4-(ca) structures we obtain 



(3) 
''7.5 



ln(Q!fc.cr) 

ln(afc.cr + 1) 



n— 75,/e— 3,íT— 3 



75 



/ln(0.4914) 
Vln(0.5587) 



16. 



Figure [TÜl shows how well the "shifting" works for 2- and 3-(nc), 3-(ca) mfe RNA structures. 




FlGURE 10. Proposition [T] at work: the Ihs shows 2-(nc), 3-(ca) mfe structures for 
n = 75 (red) and 2-(nc), 4-(ca) mfe structures for n = 85 (magenta). The rhs displays 
3-(nc), 3-(ca) mfe structures for n = 75 (red) and 3-(nc) 4-(ca) mfe structures for n = 90 
(black). 



Proposition[l]brings us now in the position to undcrstand observation (a): the discrepancy of the 
irreducibility of 2-(nc), 3-(ca) mfe structures of length 75 and random structurcs. To this end we 
compare in Figure ÍTT] the irreducibility of 2-(nc), l-(ca) mfe structures [T5| for n = 75 with that 
of random structures. In view of 



A 



75 



75 



ln(a2,i) 
ln(a2,3) 



- 1 



75 



/ln(G.4369) 
Vln(0.6053) 



- 1 



= 48, 



the former correspond to 2-(nc), 3-(ca) structures of length 75 + 48 = 123. Accordingly, Propo- 
sition[T]and Figure [TTj imply that 2-(nc), 3-(ca) mfe structures of lengtli 123 exhibit an almost 
identical distribution as random structures of infinite sequence length. Finally, we remark upon thc 
"paradox" that Proposition[TJ-being entircly based on the limit distribution of random structures- 
allows us to obtain information about mfe structures of length 75. 



As for obscrvation (b), rccall that 3-(nc) structurcs consist of two distinct combinatorial classes: 
2-(nc) and 2-crossing RNA structures. Thc combinatorics of 2- and 3-(nc) RNA structures [TÜ] . 
implics that thc numbcr of 2-crossings is cxponentially largcr than thc numbcr of 2-(nc) RNA 
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5 10 15 



FlGURE 11. The distributions of irreducibles of 2-(nc), l-(ca) mfe (red) and random 
structures (blue) for n = 75. 



k 


2 


3 


4 


5 


6 


7 


8 


9 


a = 


3 


1.6521 


2.0348 


2.2644 


2.4432 


2.5932 


2.7243 


2.8414 


2.9480 


a = 


4 


1.5375 


1.7898 


1.9370 


2.0488 


2.1407 


2.2198 


2.2896 


2.3523 


a = 


5 


1.4613 


1.6465 


1.7532 


1.8330 


1.8979 


1.9532 


2.0016 


2.0449 


a = 


6 


1.4063 


1.5515 


1.6345 


1.6960 


1.7457 


1.7877 


1.8243 


1.8569 



Table 3. The exponential growth rates of fc-(nc), cr-(ca), RNA structures where cr > 3. 



structures, see Table|31 To be concrete, the ratio of 2-(nc) over 2-crossing random 3-(nc) structures 
for n = 75 is w 6 X 10~^, whilc thc ratio of 2-(nc) versus 2-(nc) RNA structures generated by cross 
is « 1.7027. In othcr words, cross overrepresents 2-(nc) structures at a rate of approximately 
300 000 : 1. In Figurc [T2l wc iUustratc thc fact that 3-(nc), 3-(ca) mfc structures are more similar 
to 2-(nc) than to 3-(nc) random structurcs. 
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FlGURE 12. 2- and 3-(nc) random stmctures (blue/black) versus 3-(nc) mfe structures 
(magenta) for n = 85. The Ihs and rhs display these curves for a — 3 and a — 4, 
respectively. 

4. The distribution of returns 

In this section we study the distribution of returns, i.e. the endpoint locations of irreducible sub- 
structures in RNA random and RNA mfe structures. In other words, we compute the probability 
for a particular position to be the endpoint of an irreducible substructurc. Lct xi-^) dcnote thc sct of 
returns of a given structure s. Clcarly, for each return at i thcrc cxists an irrcduciblc substructure 
starting a,t j + 1 and ending at i. Accordingly, a structure decomposcs into 3 distinct segments: thc 
first being an arbitrary substructure over the [1, j], the second being an irrcduciblc substructure 
over + i] and the third being an arbitrary substructure over [¿ -|- 1, n] , see Figure[T31 We denote 




1 j j+1 i i+1 n 



FlGURE 13. Three distinct segments: The first, [1, j] and the last, [i -|- 1, n] contains an 
arbitrary RNA structure, respectively. While the second, [j + 1, i] contains an irreducible 
RNA structure. 

thc n-th coefficicnt of thc gcncrating function of Rfc ,j(z) by Rk.a{n-), i.c, Rk,ij{n) is thc numbcr of 
irreducible fc-(nc), cr-(ca) RNA structures ovcr lcngth n. Let Tk,a{n) denote the coefiicient Tk,cr{z), 
i.e., Tk^cr{n) is the number of fc-(nc), cr-(ca) RNA structures over length n. Finally, let a¿(n) denote 
the number of RNA structures of length n containing i as a return. 
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Proposition 2. Let k > 2 and a > 1 be natural numbers, ¡jl = (fc — 1)^ + (fc — l)/2 and ak,a denote 
the unique dominant singularity ofTk.aiz). Then 

and in particular for ¿ — > oo and (n — i) ^ oo 

P[i + 1 g x{s)] f n-i 
V[i G x{s)] \n-i-l 



Proposition[2]implies that returns are most likely to occur at the end of the sequence, see Figure[T31 
Furthcrmorc the probabihty for the occurrence at the end of the sequence is exponential with 
exponent /i = (fc — 1)^ + (fc — l)/2. Consequently, larger crossing numbcrs imply that "later" 
returns become more likely. 




FlGURE 14. The distribution of returns for mfe and random structures of finite size: 
the Ihs shows 2-(nc), 3-(ca), mfe and random structures of length 75 (red/blue). The rhs 
displays 3-(nc), 3-(ca), mfe and random structures of length 85 (magenta/blue). 



5. Irreducible substructures 



In this section we compute the distribution of sizcs of large irreducible substructures. In Section[5] 
we established that an irreducible substructure is typically "large" , in the following we shall prove 
substantial improvcmcnts: thc sizc of thc largcst, irrcduciblc substructure is of sizc at lcast n—0{l). 
Scction [3] and Scction |4] show, that RNA structures typically decompose into cither onc or two 
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irreducible components. We therefore restrict our analysis to thcse two scenarios. We shall begin 
by studying the size, x„, of an unique irreducible RNA substructure. 

Lemma 2. Suppose an RNA structure contains an unique, irreducible substructure, s, then 

{n~ Xn + l)c • a^ñ'^Qífe 

(5.1) P(|s| =Xn\s is unique) ~ ■ — . 

c- \-r-^ — ) n-t'aTl 

\l-ak,c, J K^o^ 

In particular, any unique irreducible substructure has size of at least n — 0(1). 
Furthermore for 

(5.2) 1 Xn < n - 



1 - «fe , 



the conditional probabihty of having an unique irreducible substructure s of size Xn is strictly 
monotone in x„ and for Xn = n — given by 



n — ^oo 



(5.3) lim P ( |s| = 71 — — I s is uniquc ) = (1 — afc o-) otí 

" 1 - a.h,a ) 



Lemma [2] and the above observations show that a unique largest irreducible component is of size 
almost n. The few, remaining unpaired nucleotides are size 0(1), see Figure [71 The random 
structure distributions of length 85 derived in Lemnia [5] arc givcn in Figure [12] togcther with the 
distributions for 2- and 3-(nc), 3-(ca) RNA mfc structurcs of lcngth n = 85. As for random 
structures, we observe that in this case the dominant singularities are ^2,3 = 0.6053 and 013.3 = 
0.4914 and thc maximal probabihties as specificd in cq. (|5.3p are given by 



(1-02,3) «2,3 ~ 0.18 

(l-a3,3)«3."3°''' « 0.25, 

respectively. The distributions of 2- and 3-(nc), 3-(ca) mfe structures exhibit similar features as 
those of random structures. Since thc data on mfe structures are obtained by samphng random 
sequences having a unique, irreducible substructure, they represent a refinement of the data given 
in Figure[T4l Next we consider the case of two irreducible substructures. As in thc casc of a unique 
irrcduciblc substructure, hcre, the largcr of thc two irreducibles contains almost all nucleotides. 
The proofs, however, become substantially more involved. 
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FlGURE 15. The distribution of the size of the unique irreducible component for mfe 
and random structures of length 85: The Ihs displays 2-(nc), 3-(ca) mfe (red) and random 
structures (blue), the rhs shows 3-(nc), 3-(ca) mfe (red) and random structures (blue). 




20 40 60 70 80 100 20 40 60 70 80 100 



FlGURE 16. The distribution of the sizes of the giant for mfe and random structures 
having exactly 2 irreducible components. The Ihs: 2-(nc), 3-(ca) mfe structures (red) 
and random structures for n — 85 and 100 (blue/black). The rhs: 3-(nc), 3-(ca) mfe 
structures (red) and random structures for n = 85 and 100 (blue/black). 



Lemma 3. Suppose we are given an RNA structure S, that contains exactly the two irreducible 
substructures, si and s^, where x„ = |si| > Is^l. Then 



(5.4) P(|si| — Xn\ S contains si.s^) 



o(l) for [n — Xn) — > oo 

2(1 - au^a)'^ ■ Ca ■ (afc^cr)" for (n - x„) -> a < oo, 
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where ¡jl ~ [k — 1)^ + (fc — l)/2, Cq > and a > 1. In particular, in the limit of long sequences, si 
has a.s. a size of at least (n — 0(1)). 

According to Lemnia[31 we have for any (n — Xn) a < oo 

(5.5) lim P(|si| = Xn I S coiitains si, s^) = 2(1 - ak.a)^ ■ Ca ■ {ak.a)"'. 

n — ^oo 

It foUows from the proof of Lemma [3] in Section [7] that Cq = (""^^^^^fc.o-í*)- Let á be 

a positive constant for which ca • {ak.a-)"' maximal. Consequently the probabiHty P(|si| = x„ | 
S contains si^s^) is maximal ai Xn = n — á, implying that the size of the largest irreducible 
component is in the Umit of fong sequences typicaUy n — a, with probabihty (1 — ak,a)^c.a ■ {ak.a)"' . 
Note that for 3-(nc), 3-(ca) random structures of length n, we have «3^3 = 0.4914. Maximizing 
the term ca • {ak.a)"' yields a = 14 and accordingly the size of the largest component is hkely to be 
n — 14. Figure fT6l confirms that aheady for n = 85, the size of the largest irreducible component is 
typically 70. Figure ÍTBl shows first that the probabihty P(|si| = Xn \ S contains si,S2) is sharply 
concentrated at imphed by Lenima [31 Second, as n increases, thc distribution of 

component sizes shifts into a hmit distribution which is sharply concentrated at n — á. Furthermore 
we remark that, by construction, á is independent of n, see Figure íTTl For n = 75 and n = 85, the 
size of the largest irreducible component is locahzed at ^75 = 60 and xgs = 70. 




FlGURE 17. (n — a-„) is independent n: we display the distribution of the sizes of the 
giants conditional on the existence of two irreducibles. The Ihs shows 2-(nc), 3-(ca) 
random structures for n = 75 and the rhs for n = 85, respectively. In both distributions 
we highlight the distance (red) between the typical size of the giant and the end of the 
sequence. 
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6. DlSCUSSION 



Let us integrate our results and put thcm into context. Employing thc Motzkin-path (Figure[T]) 
interpretation, it is straightforward to construct random RNA secondary structures. However, 
random RNA pseudoknot structures are a different matter. Their inherent cross-serial dependencies 
(Figure[2]) prohibit constructive recurrence relations and despite their D-finiteness [5], at present 
time, there exists no computer algorithm that can construct a random RNA pseudoknot structure in 
polynomial timc with uniform probabihty. Conscqucntly, any data on limit distributions of random 
RNA pscudoknot structurcs arc nontrivial and virtually impossiblc to obtain computationaUy. 
In this paper we have shown that random RNA structures decompose into a small number of 
irreducible substructures, sec Figurcs [21 [6| and [71 We estabhshed in Section \5\ that one of these 
irrcduciblcs is in fact a "giant" , i.c. it contains almost all nuclcotides. Key structural parameters, 
like the maximum numbcr of mutuaUy crossing bonds, as weU as the minimum stack size do 
not fundamcntaUy change this picture, see Figure [H and Figurc [TTl In Section [3l we discussed 
the distribution of random structures and mfe structures. We actuaUy used the hmit of long 
sequences in order to prove a shift-result, aUowing for the reduction of k-{nc) cr-(ca) structures 
over n to fc-(nc), (cr — j)-(ca) structures over n — f{j). Figurc [TÜI iUustrates that the original and 
thc corrcspondingly shiftcd distributions of mfc structures virtuaUy "coincide" . Mfe and random 
structurcs cxhibit significant diffcrcnccs. Most striking maybc is thc vast prcfcrcncc of noncrossing 
RNA pseudoknot structures over their crossing counterparts. While the percentage of 63% of folded 
noncrossing configurations for n = 75 does not seem to be particularly remarkable, Theorem[Tlof 
Section [21 shows, that the above percentage is equivalent to a factor of 300 000 : 1, relative to 
random samphng. This is certainly a consequence of the currently implemented pseudoknot- 
loop energy parameters. At this point it is pure speculation whether or not different energy 
parameters or significantly longer sequence length wiU alter this picture. In [4] the reader can 
find more data on the fraction of noncrossing configurations of 3-(nc) mfe structures. In any case, 
the overrepresentation of noncrossing configurations imphcs, that the distributions of 3-(nc) mfe 
structurcs are in fact morc similar to those of random RNA sccondary structures. 

Having estabhshcd that, cvcn in thc hmit of long sequences, only a few irreducibles exist, the next 
question is to determine their respective sizes. In Section[4land Section[5lwe achieve this by studying 
returns, that is the endpoints of irreducible substructures and the distribution of their sizes in 
Lemma[2land Lemma[3l For random structures we observe, that the largest irreducible component 
is a giant as it contains almost aU nucleotides. For mfc structures we observe a systematic shift 
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towards smaller sizes of the giant, however a giant irreducible substructure typically also exists 
in mfe structures. Aside from these structural results we present in Section[5l eq. (j5.3p . a simple 
formula for idcntifying the typical size of an unique giant and confirm in Figure [12] its appUcabihty 
to mfe structures. Along these hnes we furthcrmore locahze the typical size of the giant in case of 
two irreducible substructures. In addition we make its dependence on k, a and n exphcit. 

7. Proofs 

Proof of Proposition [Ij 

Proof. ZJ-finiteness ¡13j of Rfc^cr(^) guarantees the existence of analytic continuation in somc simply 
connccted domain containing zero around the dominant singularity ak,a- Thcrcfore thc singular 
expansion of Rk,aiz) at its dominant singularity ak,a exists and in casc of fc = 1 mod 2 we have, 
setting yu = (fc - 1)^ + 

(7.1) Rfc,.(z) = Tfc - Cfc f 1 - —) In (l - —) (1 + o(l)), where Ck > 0. 
In case of fc = mod 2, the singular cxpansion is givcn by 

(7.2) Rfc.,,(z) =Tfc -Cfc fl- — ) (l + o(l)), wherecfc>0. 

In the foUowing we restrict ourselves to the analysis of the case fc = 1 mod 2. Thc argumcnts for 
fc = mod 2 are completely analogous. From the singular expansion of Rfc.o-^z) we derive 

Rk.ain)-n~>^ í— ) 

\OLk.a ) 

and, apparcntly, Rk,a+\ij^ < Rk.ain). For sufiiciently large n, we have 
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Suppose lim„^oo ( ^'''"rI ^"(t)" ) = c > 0, thcn 



(7.3) 

Accordingly 

and eq. p.ip foUows. Using the singular expansion of Rfc ^(-z) given in eq. (|7.ip we derive 



-Rfe.cr+i("- + a;!'''') / ln(afe,cr) 



/?fc,.j(n) = [z^]RkA^Y 



1- z 



Tfc - Cfc 1 



ak.a 



In 1 



Qífc,, 



(1 + 0(1)) 



1 y+^ 



c!"'^^ ■ n ^a. for some constant cV'^ > 

k.cr k-.a k.a 



.(1) 



and conscqucntly 



í^k.a+i.jin + y„,fc) 4,i+i ■ (" + ^'^fc,"+i"''' foi" '^o™'^ constant c^^^^ > 0. 

We observe that Pk.a.jin) w Pk.a+i.jin + Un.k) holds only iiyn.k = [cfe '«J for some constant Cfc > 
with thc propcrty 



(7.5) 



Cüfc,, 



\ _ ^k.a 

1+Ck ] ^ (1) 



\{0íkM+l) / C 

The solution Cfc of cq. (|7.5p is asymptotically 

ln(afc^£r) 



(1+Cfc)^ 



Cfc 



ln(afc^cr+i) ' 



whence Un.k = x'rf' and eq. (j3.2p is established 



□ 



Proof of Proposition [2l 



Proof. We begin by splitting a given structure s at j and ¿: 

= ^ Tk.aU) ■ Rk.aii - j) ■ Tk,a{n - i) = Tk.a{n - «) " ^ Tk.cr{.j) ■ Rk.a{i " j) 

j<i j<i 
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and X],<j TkMÍj) ' Rk,a{i - j) = [z^] {Tk,a{z)'R.k,a{z)). Furthermore, we derive 



J2Tk,a{n-l)-[z'] {Tk,a{z)Rk,a{z)) 

n 

J2[z"-']Tk,a{z) ■ M {Tk,a{z)Rk,a{z)) 



^ni rri2 



= [z-]TÍ,{z)Rk,a{z). 
Consequently, the probabihty of a return at position i is given by 

Tk,a{n-Í) ■ [Z'] {Tk,a{z)Rk.a{z)) 



(7.6) 



¥[i e x{-s)] 



[z"]TlJz)Rk,a{z) 



Using Rk,a{z) ~ 1 — z — we rewrite the probabihty P[i G x{^)] 



(7.7) 



P[zex(s)] = 



Tk,a{n-t)^[z']{{l-z)Tk,a{z)~l) 



[z"]{l-z)TlJz)-Tk,a{z) ■ 
We next use the singular expansion of Tk,a{z) at the dominant singularity a^,, 



(7.8) 



T^k,a{z) 



0((1 - ^)^-i ln(l - -^)) for k odd as 



z ak,a 



for k even as z ^ afc cr , 



where /i = (fc — 1)^ + (fc — l)/2. We restrict ourselves proving the case of fc = 1 mod 2, the 
case k = Q mod 2 foUows analogously. Since D-finite power series form an algebra, Qk.a{z) = 
(1 — z^T'^ ^{z) — Tk,a{z) is I?-finite and analytic continuation and singular cxpansion cxist. Thc 
latter is givcn by 



Clk,a{z) = {1-Z)0 



(1 



ak 



Oik ' 



0((1-— )^-'ln(l-— )), z^ak,, 



= 01(1-— )^-iln(l-— )), z^ak.o 
\ ak ak J 

Therefore, extracting the coefficients of the singular expansion, we obtain 



(7.9) 



[z^]Clk,a{z)^n-'^a-- 
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Using [z'^]Tk,a{z) = T{i) ^ i ^a.f^'^^ as i — > cx) we arrive at 

T{n - i) [[z']TkAz) - [z'-^]TkAz)] 



[2"]Qfc,^(^) 



1 - -V [r^- - (i - l)-'^afe..] . 

n J 

From this wc immediately concludc in case of i — > oo and (n — i) — > oo 

n + 1 e x(s)] (1 - ^)""" [(* + 1)-^ - ^-^auA 



n^x{s)] (i-±) ^[z-A' -(z-i)-Ma,,,] 



n — i — 1 



i — > oo, (n — i) ^ oo. 



Proof of Lemma [2l 



Proof. Using the singular expansion of Rfc.o-(z), eq. (|7.ip and cq. (|7.2p we obtain 



'fe.cr' 



(7.10) á„.i = [z"] (j^) HkA^) ~ c • (-^ ^ 

Wc procecd by computing 

(7.11) á„a,,„ = [z-"]Rfc,,(z) • [z"--"] (y^) ~ (^ - + l)c • x-^a^; 
and combining cq. (|7.10p and eq. (|7.1ip . we arrivc at 

S„,i,x„ {n-~x„ + í)c-x-i'a',:A 



(7.12) 



2 



n f^a,. ' 
k,< 



\ l-afc,<T J 

The critical term here in eq. (|7.12p is readily identified to be a^A^" and consequently 
(7.13) hm%i^>0 => a;„ = 71-0(1), 

whcnce the lemma. 
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Proof of Lemma [3l 



Proof. We distinguish thc cascs Xn < ^ and x„ > In casc oí Xn < ^ we have 

(7.14) <5„,2 = [^"] (y^) r^L(^) ~ c • (y3^) ""''"fe,-' ^ > 0. 

Therefore the mimber of structures containing si oí size x„ is given by 

¿n,2,.„ = klRfe.-(^) • [^*]Rfc,^(^) ■ [^"~'"*] (y^) 

max(s,í)— a:^ 

3 



2[z-"]R,,.(z) . g[z*]R,,.(z) . [z—"-*] 



The term [^"-^n-*] ^—L-^ represents the number of compositions of thc integcr m = n — x„ — í 
into at most 3 distinct parts, denoted by P{u, 3). Assuming the first part to be i, ranging from 1 
to u, the number of ways of dividing (u — i) into at most 2 parts is (u — z + 1), whence 

[^"] ( y^) ' = ¿ P(« - 2) = ¿(u - * + 1) = 



u + 2 
2 



Consequently, we can rewrite á„,2,a;„ as 

(7.15) ¿n,2,.„ = 2[z-"]Rfe,,(z) • ¿ (" ~ ^ • [z*]Rfe,.(z). 

Claim. Suppose a;„ and n — x^ tcnd to infinity, as n tcnds to infinity. Thcn there cxists somc 
constant k > such that 

(7.16) ^( " [z»]R,,,(z) = K. [z-"]R,,,(z). 



¿=1 ^ 

According to the Claim 



2[.-]R,,.(z).¿(" * + ^).[z*]R,,.(.) 

2[^-"]R,.,(z) . Aí('' " ^ ^) [^""]Rfe,-(^) 



2.c^ . x-^a--" 



n — 2xn + 2 
2 
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whence the probabihty of containing thc largest componcnt of size a;„ < ^ is given by 



(7.17) 



o(l). 



In casc of x„ > we derive, 

max(s,í)— 

= 2[z-"]Rfc,,(z) • X [z*]Rfc,.(2) ■ [z"--"-*] í — 



2[z-"]R,,,(z).[z~]Rfc,,(z) 



1 - z 



Suppose [n — Xn) — > c>o, then thc singular expansion imphes 



]Rfc,a(z) 



1 



1-2 



/í • (n - a;„) ^ (afc,^-) 



Accordingly, we derivc thc fohowing asymptotic cxprcssion for <5„.2,k„ 



<5n,2,=.„ = 2[z""]Rfc,<,(z) • [^"-""]Rfc,,(z) 



1 



1 - z 

~ 2cK • x-^a¡:^^"(ri - x»)-''^^:!,""''"^ 

Thcrcforc wc arrivc at 

¿n,2.x„ 2cK • a;-^ar^"(n - a;„)-''a¡:j,""''"^ 



(7.18) 



Notc that (n — x^) ^ oo implics, that Xmij^ 



(7.19) 



¿ra,2,a:„ 



2K(l-afc,<,)3 



2k(1 - afc,^)'^ 



: i' for ^ < < 1. Conscqucntly 
= 2k(1 - a,,,)3 [i/ • 7i(l - I.)]-" = o(l). 



from which we immediatcly conclude 



(7.20) 



hm 



¿n,2,x„ „ „ - " 



=0 for Xn > — and (n — .t„) — > cxd. 

^oo ó„.2 2 



In case of (n — Xn) a < oo, we set 
(7.21) 



( . )i?fc,.(í) = [z^Rfc,<.(z) ' 



1 - z 



24 



EMMA Y. JIN* AND CHRISTIAN M. REIDYS*'t 



Accordingly 

¿n,2,.„ = 2[z-"]Rfc.,(z) • [^"-"]Rfe,,(z) 

~ 2Ca ■ C ■ {ak^ay" ■ 

We accordingiy arrive at 



1 



1 - z 



ÍT 00\ ^n,2,x„ 2Ca ■ C ■ X^^ {ak,a) ^ ^3 / 

(7.22) — ; -r^ = ZCa ■ (1 - afc.cr) • [ak,a) , 

0n,2 ^ í 1 



We observe that x„ < ^ implics (n — x„) — > oo, therefore we conclude in the case {n — Xn) 



lim — = L). 

n-»oo dn.2 



Whilc (n — Xn) — > a < 00 imphcs Xn > we concludc that in thc casc {n — x^) a < oo, 

lim ^-I^^2Ca^{l~ak,a)' ■{ak,a)''. 
n^oo dn,2 

complcting thc proof of thc lcmma. 



Proof of the Claim. 



ProoJ. Sct 

(7.23) A,= [ " \[7j]Rk,a{z) = [ " ]Rk,a{í) 



We first show that A^^ is the maximal term Ai for 1 < i < In vicw of thc fact that Xr. 

A^+i n-Xn-Í Rk.a{i + Í) 

lim — - — = hm : • — — — — 

n-^oc Ai n^oc rí — .T„ — ¿ + 2 Rk.a{i) 

1 Rk,a{t+l) 



= hm 

n — >cx 

> hm 



+ Rk.a{t) 
1 Rk,a{Í + l) 



"'^'^l + ^ Rk,a{i) 

2 ' 

Rk.a{l + l) ^ 
Rk,a{Í) 
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Therefore Ax^ is maximal. We sliall show next 

(7.24) V < a < 1; ^ = o(l) • . 



In view of J2i<x„-n'' < i^n - "■")Aa;„-ri°, we obtain 



n-2a;„+n°+2 
2 



)^fc,<T(a::>i - "") 



Using Rk,a{n) ~ cin^^ajT^, for some ci > 0, we arrive at 
E.<.„-„=^ ^ (■T„~n")("-^-"+"°+^)flfc,.(x„-n") 



n-2a;„+2' 
2 



)-Rfe,cr(a;„) 



("-2-"+2)i?,,,(x„) 



n - 2xn + 2 



n — 2xr, + 1 



a¡.. , 



(x„ - n") 

whence eq. (|7.24p . Next we claim that the tcrms close to x„ contribute at most O^A^^) 

n°^ 

(7.25) =0(A,J. 

3=0 



To provc this. wc computc for 1 < j < n" 



A, 



< 



J 



n — 2xn + 2 
j 



n — 2x„ + 2 



J 



n - 2xn + 1 
j 

n — 2xn + 1 



'fc.cr 



< (i + ^)(i + jK.. 



Taking the sum over aU j wc obtain 



J=0 



< E(i + i)(i+^> 

J=0 



Q! 



n° + l [('anCt 
fc.fT 



[(n" + l)^{ak,a - l)'(n" + l)(afc-,. - 2)^ - + l] - 2 



2(«fc,. - 1)3 



whence 



lim y 



n-^co '—' A., 
j=0 



< 



1 



(l-«fc.J3- 
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Thcrefore, we obtain Vl" „ At _ , < -n — - — vt^t and we arrive at 

(7.26) I]^^= H H =o(A,J+0(A,J =K-A,„, 

¿—0 í<a;„— n" i>Xn—n°' 

for somc constant k > 0, proving thc Claim. □ 
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